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Abstract. Community structure is largely regarded as an intrinsic prop- 
erty of complex real-world networks. However, recent studies reveal that 
networks comprise even more sophisticated modules than classical cohe- 
sive communities. More precisely, real-world networks can also be natu- 
rally partitioned according to common patterns of connections between 
the nodes. Recently, a propagation based algorithm has been proposed 
for the detection of arbitrary network modules. We here advance the lat- 
ter with a more adequate community modeling based on network cluster- 
ing. The resulting algorithm is evaluated on various synthetic benchmark 
networks and random graphs. It is shown to be comparable to current 
state-of-the-art algorithms, however, in contrast to other approaches, it 
does not require some prior knowledge of the true community structure. 
To demonstrate its generality, we further employ the proposed algorithm 
for community detection in different unipartite and bipartite real-world 
networks, for generalized community detection and also predictive data 
clustering. 

Keywords: link-density community, link-pattern community, propaga- 
tion, community detection, data clustering 



1 Introduction 

Over a decade of research in network analysis has revealed a number of common 
properties of complex real- world networks |57I13) . Community structure |18j— 
the occurrence of cohesive modules of nodes — is of particular interest as it pro- 
vides an insight into not only structural organization but also functional behavior 
of various real- world systems [35j2j . The analysis of communities has thus been 
the focus of many recent endeavors |15l38j , while community structure analysis 
is also considered as one of the most prominent areas of network science 



However, most of the past work was constrained to communities character- 
ized by higher density of links — link-density communities |18j (Fig. |l(a)[ ). In 
contrast to the latter, recent studies reveal that networks comprise even more 
sophisticated modules than classical cohesive communities [3411137152"] . In par- 
ticular, real- world networks can also be naturally partitioned according to com- 
mon patterns of connections among nodes — into link-pattern communities [28 34 



(Fig. 1(b)). Link-pattern communities can in fact be related to relevant func- 



tional roles in various complex systems |37|52j . moreover, they also provide a 



(a) Zachary's karate network [59] (b) Davis's women network [9] 
Fig. 1. Comparison between (a) link-density and (b) link-pattern communities. 



further comprehension of real-world network structure that is obscure under 
classical frameworks. Note that link-density communities could be seen as a spe- 
cial case of link-pattern communities, although several fundamental differences 
exist 52 . In particular, link-pattern communities do not correspond to densely 
connected groups of nodes, while generally do not even feature connectedness. 
The latter actually implies low transitivity — clustering coefficient [57] — for the 
nodes in link-pattern communities, which contradicts with small-world phenom- 
ena |57j . However, recent work suggests that best link-pattern communities might 
indeed emerge in parts of networks that exhibit low values of clustering (e.g., 
technological networks) , where small- world property does not generally hold [50] . 

Recently, Subelj and Bajec [52] have proposed a general propagation algo- 
rithm that can reveal arbitrary network modules ranging from link-density to 
link-pattern communities. Their algorithm does not require any prior knowledge 
of the true structure, though they introduce a community parameter that models 
the nature of each community according to the measure of network bottlenecks — 
conductance [5] . We advance the latter by proposing a more adequate modeling 
strategy based on node clustering coefficient [57] . The resulting algorithm is 
evaluated on various synthetic benchmark networks with planted partition, on 
random graphs and also resolution limit examples. It is shown to be comparable 
to current state-of-the-art, whereas, the proposed strategy also greatly improves 
on the approach of Subelj and Bajec [52 (on these networks). Furthermore, 
to demonstrate its generality, we also employ the algorithm for community de- 
tection in different unipartite and bipartite real-world networks, for generalized 
community detection and predictive data clustering. 

The rest of the paper is structured as follows. In Section [2] we briefly review 
relevant related work, with emphasis on the community detection literature. 
Section [3] introduces the proposed algorithm, while the empirical evaluation with 
formal discussion is done in Section [4] The performance on various real- world 
examples is presented in Section [5] and conclusions are made in Section [6] 



2 Related work 



Despite the wealth of the literature on classical communities in recent years [15J , 
only a small number of authors have considered more general link-pattern com- 
munities. Nevertheless, authors have recently proposed different algorithms based 
on stochastic blockmodels [1)20] . mixture models [34)47] . model selection [28)37] . 
data clustering [27] and other [36]. However, in contrast to the propagation algo- 
rithm proposed in this paper, and that in [32], all other approaches require some 
prior knowledge of the true structure (e.g., the number of communities). The 
latter indeed seriously limits their use in practice. Note that authors have also 
analyzed vertex similarity based on common patterns of connections [4126] — 
commonly referred to as structural equivalence — whereas, some of the research 
on classical communities also apply for link-pattern counterparts [19145] . 

It ought to be mentioned that link-pattern communities are known as block- 
models [55] in social networks literature. These have been extensively studied in 
the past, however, their main focus and employed formulation differs from ours. 

3 Model-based propagation 

Let the network be represented by an undirected graph G(N, L), where N is the 
set of nodes of the graph and L is the set of its links (edges) . Furthermore, let 
w nm be the weight of the link between nodes n,mGW. Moreover, let c n denote 
the community (label) of node n £ N, and let T n be the set of its neighbors. 

The proposed model-based propagation algorithm is, as the algorithm in [52] . 
based on the label propagation principle of Raghavan et al. [3J. In the following, 
we thus first introduce the latter. 

Label Propagation. Label propagation algorithm 39J (LPA) reveals link-density 
communities by exploiting the following procedure. First, each node n € N is la- 
beled with a unique label (i.e., c„ = l n ). Then, at each iteration, the node adopts 
the label shared by most of its neighbors (with respect to link weights). Hence, 



where T l n is the set of neighbors of node n that share label I (ties are broken 
uniformly at randonQ. Due to the existence of many intra-community links, 
relative to the number of inter-community links, cohesive modules of nodes form 
a consensus on some label after a few iterations. Thus, when the algorithm 
converges — a local equilibrium is reached — disconnected sets of nodes sharing 
the same label are classified into the same community. Due to extremely fast 
structural inference of label propagation, the algorithm exhibits near linear com- 
plexity and can easily scale to networks with millions of nodes and links [53139] . 

Note that, to address issues with oscillations of labels in some networks (e.g., 
bipartite networks), label updates in Eq. ([I]) occur in a random order [55] . 

1 When node's current label is among most frequent, the node retains its label. 



c n = argmax 




(1) 



General Propagation. Subelj and Bajec [52) have argued that label propagation 
cannot be directly applied for the detection of link-pattern communities, as the 
bare nature of propagation requires connected (and cohesive) groups of nodes. 
However, when one considers second order neighborhoods, and propagates labels 
through nodes' neighbors, link-pattern communities indeed correspond to cohe- 
sive modules of nodes (see Fig. |l(b) ). Based on the above they have proposed 



general propagation algorithm [52] (GPA) that is presented in the following. 

Let S c be a community parameter that models the nature of community c, 
5 C G [0, 1]. Assume S c equals 1 and for link-density and link-pattern communi- 
ties, respectively (to be properly defined later). Label propagation in Eq. ([!]) is 
then advanced into a general community detection algorithm as 



argmax 




i < m b m d m I (2) 

mer^\r ra |«er n 



where w„ m = "'" 3 jj "' 3 "' and s n is the strength of node n € N (i.e., s n = 
Smer„ w nm)- In the case of link-density communities (left-hand side of Eq. (|2j)), 
the labels are propagated among the neighbors as before, whereas, in the case 
of link-pattern communities (right-hand side of Eq. ([2])), the labels are propa- 
gated through nodes' neighbors — between the nodes at distance two. Thus, the 
algorithm can indeed reveal either link-density or link-pattern communities, or 
different mixtures of both, when they are clearly depicted in the network's topol- 
ogy. (Note that in [53] the algorithm was presented for unweighted networks.) 

Node balancers b n [51] and diffusion values d n , d n |53)52j in Eq. ^ improve 
the algorithm's stability and accuracy, respectively. More precisely, random label 
update orders (see above) severely hamper the robustness of the approach, and 
consequently also the stability of the identified community structure [51]. In 
particular, nodes that are updated at the beginning exhibit higher propagation 
preferences than those that are updated towards the end [51] . Thus, balancers b n 
are utilized to counteract for the randomness introduced by update orders — lower 
and higher preferences are given to the nodes updated first and last, respectively. 

Let i n denote a normalized position of node n G N in some random order, 
i n G (0,1]. Then, node balancers are set according to 

(3) 



1 + e -M(«n-^) ' 



where A and \i are parameters of the algorithm. Intuitively, we fix A to | , while \i 
is set to 2 based on some preliminary experiments (see Section]!]). Node balancers 
can also be modeled with a linear function as b n — i n , however, introduction 
of the above parameters allows for a distinct control over the algorithm. In 
particular, analysis in Section [4] reveals that increasing [i improves the stability 
of the algorithm, although the computational time thus also increases. Note also 
that setting /i to yields a classical label propagation where all b n are equal. 

To further boost the community detection strength of the algorithm, de- 
fensive preservation of communities is employed through diffusion values d n , 



d n |53|52) . Here higher diffusion values — propagation preferences — are given to 
core nodes of each (current) community, while lower values are given to their 
border nodes. The latter results in an immense ability of detecting communities, 
even when they are only weakly depicted in the network's topology 53 . At each 
iteration, diffusion values are estimated by means of a random walker utilized 
on each (current) community. Hence, 

dn = d ™l k m ( 4 ) 

and 

mer?»\r B |a6r B ^ ser ™ s 

where k^ 1 is the intra-community degree of node n € N (all d n , d n are initial- 
ized to j^y)- Besides deriving an estimate of the core and border of each com- 
munity, the main rationale here is to formulate propagation — diffusion — within 
each community, to estimate the current state of label propagation, and then 
to adequately alter the dynamics of the process. Analysis in Section [4] reveals 
that defensive preservation of communities significantly improves the detection 
strength of the algorithm, while for further discussion and analysis see [S3] . 

Despite the discussion above, the core of the algorithm is in fact repre- 
sented by a community modeling strategy implemented through parameters 8 C . 
Subclj and Bajec [52] have proposed to measure the conductance [5] of each com- 
munity, to determine whether it better conforms with link-density or link-pattern 
regime. Conductance $(c) of community c is defined as a relative size of the cor- 
responding network cut — ratio of inter-community links — thus it is a measure 
of network bottlenecks. Hence, at each iteration, they simply set 5 C — 1 — $(c), 
while all S c are initialized to ~. The main weakness of their strategy is that 
each community is considered independently of other. Thus, in the following, we 
propose a more adequate community modeling strategy based on the properties 
of complex real- world networks. 

Model-based Propagation. Community modeling strategy of Subelj and Bajec |52| 
considers merely the nature of each respective community, whereas all other 
communities are disregarded. Although no proper empirical study exists, in an 
ideal case, link-pattern communities would link to other link-pattern communi- 
ties rather than to other link-density communities. The latter follows from the 
fact that the concerned links would else obviously decrease the quality of the re- 
spective link-density community — make it a link-pattern community. Thus, we 
propose a community model based on the hypothesis that the neighbors' com- 
munities should be of the same type — either link-density or link-pattern — as the 
concerned node's community. Hence, 



(6) 



where k n is the degree of node n € N and N c is the set of nodes in community c. 

We also argue that an adequate initialization of community parameters 5 C 
is of vital importance (exact results are omitted). Otherwise, the algorithm can 
easily get trapped in some local stable — probably suboptimal — fixed point that 
is hard to escape from. However, Eq. ^ cannot be directly employed at the 
beginning, as all nodes still reside in their own communities. We thus refine the 
above hypothesis such that the node's neighbors should not only reside in the 
same type of the community, but in the same respective community. The latter 
immediately implies that the neighbors of the nodes in link-density communities 
should also link to each other, whereas the opposite holds for the nodes in link- 
pattern communities. Hence, for each node n € N, one could initially set 5 Cn 
to C„, where C n is a node clustering coefficient [57] defined as the probability 
that two neighbors of node n also link to each other — network transitivity. It 
ought to be mentioned that recent work suggests that transitivity — rather than 
homophily — gives rise to the modular structure in real- world networks |17j . 

However, consider a node with very high degree — a hub node. Hubs com- 
monly appear in link-density communities |19j . still, due to a large number of 
links, they would only rarely experience high values of clustering coefficient (the 
opposite would in fact imply a large clique). Also, as most networks are disassor- 
tative by degree [3T] , hubs tend to link to low degree nodes that cannot provide 
for high clustering of the hub node |48) . Indeed, in many real- world networks 
node clustering coefficient roughly follows G n ~ k^ 1 [56 40 48 , where k n is the 
degree of node n G N . Hence, we model initial communities as (assume C n > 0) 

f 1 for C n > ak- 1 + (3, (7a) 
\ p otherwise, (7b) 

where a and (3 are estimated from the network using ordinary least squares, and 
p is a parameter. We set p to j based on some preliminary experiments. 

Eq. ([7]) and Eq. ([6| define the proposed model-based propagation algorithm 
(MPA), which is else (almost) identical to the algorithm in [55] (see Alg.[T]). How- 
ever, the evaluation on synthetic and real- world networks in Section [4] and Sec- 
tion[5j respectively, reveals that the proposed approach significantly outperforms 
that in [52]. For a thorough evaluation, we also analyze two variations of the 
basic approach that fix all community parameters S c to either 1 or 0. The ap- 
proaches thus result in a fully link-density or link-pattern community detection 
algorithms, and are denoted MPA(D) and MPA(P), respectively. 

4 Evaluation and discussion 

In the following we evaluate the proposed algorithm on different synthetic bench- 
mark networks with planted partition, and also on random networks. The results 
are assessed in terms of three different measures of community significance, bor- 
rowed from the field of information theory and community detection literature. 

Let C be a partition extracted by an algorithm and let V be the known parti- 
tion of the network (corresponding random variables are C and P, respectively). 



Algorithm 1 Model-based propagation algorithm (MPA) 



Input: Graph G(N,L) and parameters A, fi, p 
Output: Communities C 
{Community initialization.} 
for n G N do 

Cn <— in {Unique label.} 
5 Cn «- {Model according to Eq. |7|.} 
d n ,d„ 4- 1/\N\ 
end for 

{Model-based propagation.} 
while not converged do 

shuffle (N) 

for n G N do 

{General propagation.} 
b n <- 1/(1 + e-"('»- A )) 

c n «- argmax; h5j X) m er^ w nm.b m d m + (1 - <5;) 2 m eri,\r„|ser„ w nm b md m 
{Re-estimation.} 

rfn «- Em 6 r=" d m /k% and d„ «- Z)m6rJ"|ser„ dm / J2 a er m 
end for 
for c G C do 

{Community modeling.} 

8c 1 / 1 7V G | X^ m6 r |neiv c ^ c m/^ n {Omitted on first iteration.} 
end for 
end while 
return C 



First — normalized mutual information [5] (NMI) — has become a de facto stan- 
dard in the recent literature. NMI of C and V is defined as h(c)+h(P) » wnere 
I(C,P) is the mutual information, and H(C), H(P) and ff(C|P) are standard 
and conditional entropies. NMI of identical partitions equals 1, and is for in- 
dependent ones, NMI £ [0, 1]. Second, we also consider normalized variation of 
information [30122] (NVOI), which is a symmetric local measure that has the 
properties of a distance in the space of partitions. NVOI of C and V equals 
^^^wp^^ i therefore, in contrast to NMI, lower values represent better cor- 
relation between partitions, NVOI € [0, 1]. Last, for a better comprehension, we 
also adopt a more intuitive measure — fraction of correctly classified nodes [TH] 
(FCC) — that is commonly adopted within community detection literature. The 
node is considered correctly classified, if it resides in the same community as at 
least one half of the nodes in its true community. Again, FCC € [0, 1]. 

Community detection algorithms introduced in Section[3]are compared against 
a greedy agglomerative optimization [3217] of modularity Q [33] (denoted MO(G)) 
— a classical link-density community detection algorithm — and a mixture model 
with expectation-maximization |10j proposed by Newman and Leicht [34j (de- 
noted MM (EM)). The latter can detect arbitrary network modules and is cur- 
rently among state-of-the-art approaches for generalized community detection |34)37j . 



However, it demands the correct number of communities to be known ahead of 
time, which puts the algorithm in significant advantage compared to others [23] . 
For simplicity, we limit the number of iterations to 100 for all algorithms. 



GN2 Benchmark. The algorithms are first applied to a class of benchmark net- 
works [37J that is in fact a generalization of a classical benchmark proposed 
by Girvan and Newman |18j . Networks comprise four communities of 32 nodes, 
whereas, two communities correspond to classical link-density modules, while 
the other two form a bipartite structure of link-pattern communities. Average 
degree is fixed to 16, while the community structure is controlled by a mixing 
parameter 9, 9 E [0, 1]. When 9 is 0, all links are set according to the designed 
community structure, while for 9 equal 1, the networks are completely random. 

The results are shown in Fig. [2] Observe that for small values of 9 only 
MPA and MPA(P) can accurately reveal the planted structure in these networks. 
However, when 9 increases, the performance of MPA is similar to that of a clas- 
sical community detection algorithm (e.g., MO(G) or MPA(D)). MMfEM) can 
detect communities to some extent until 9 < | (dashed lines in Figs. [2J [3]) — when, 
for the nodes within link-density communities, there are twice as manylinks that 
conform with the planted structure than randomly placed links. Note also that 
twice as many links are needed to define a link-pattern community, compared 
to a respective link-density community, which would yield the same threshold at 
9 = g for these networks (solid lines in Figs.[2j[3]). Thus, MPA accurately extracts 
planted link-density and link-pattern communities in these networks, as long as 
they are clearly depicted in the network's topology. Note also that community 
modeling strategy within MPA seems more adequate than that of GPA. 



SB Benchmark. GN2 benchmark provides a rather unrealistic testbed due to 
homogeneous degree and community size distributions. We address the latter by 
proposing a class of simple benchmark networks with heterogeneous community 
sizes. Networks comprise three communities of 16, 32 and 24 nodes, respectively 



(see network in Fig. 8(a)). The latter two again form a bipartite structure of 



link-pattern communities, while the third community corresponds to a classical 
cohesive module. Links are placed according to the designed community struc- 
ture such that the average degree of the nodes in the first and third community 
is fixed to 16. The latter implies an average degree of 8 for the nodes in the 
second community. Furthermore, we also add some number of links uniformly at 
random for each node — denoted node confusion degree k, k > 0. 

The results appear in Fig. [3] The performance of the algorithms is rather 
similar to that on GN2 benchmark (note different scales in Figs. [2j [3]). Only 
MPA can accurately reveal the planted structure for small values of k, while the 
model within GPA again seems to fail. Observe that MM(EM) can extract com- 
munities equally well, even when k equals 16 — only | of the links for the nodes in 
the second community still agrees with the intrinsic structure, thus, the commu- 
nities are only marginally defined. The latter clearly demonstrates that knowing 
an exact number of communities indeed presents a significant advantage. 



Mixing parameter 

(a) Analysis subject to NMI 



Mixing parameter 

(b) Analysis subject to NVOI 



Fig. 2. Analysis on GN2 benchmark networks [37]. The values are estimates over 100 
network realizations, while error bars show standard error of the mean. 




Average contusion degree 

(a) Analysis subject to NMI 



Average contusion degree 

(b) Analysis subject to NVOI 



Fig. 3. Analysis on SB benchmark networks. The values are estimates over 100 network 
realizations, while error bars show standard error of the mean. 




Mixing parameter 

(a) Analysis subject to NMI 



Mixing parameter 

(b) Analysis subject to NVOI 



Fig. 4. Analysis on LFR benchmark networks. The values are estimates over 10 network 
realizations, while error bars show standard error of the mean. To ensure convergence, 
fi is set to |. 
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Number of communities 



(a) Dendrogram (b) Analysis subject to NMI — without fills 

Fig. 5. Analysis on HN benchmark networks [6]. The values are estimates over 1000 
network realizations, while missing values of p in the legend equal ^. See also text. 



LFR Benchmark. To enable easier comparison with previous literature on com- 
munity detection, we also apply the algorithms to a class of standard benchmark 
networks with scale-free degree and community size distributions proposed by 
Lancichinetti et al. [25]. The size of the networks is set to 1000, while community 
sizes range between 10 and 50 nodes. Note that all communities here correspond 
to a link-density regime. As before, the quality of the planted structure is con- 
trolled by a mixing parameter 9, 9 G [0, 1]. For comparison, we also analyze two 
variations of MPA that do not employ either balanced propagation or defensive 
preservation of communities (denoted MPA-D and MPA-B, respectively). 

Results in Fig. [4] show that MPA most accurately reveals the planted struc- 
tures in these networks, while it also significantly outperforms the other gen- 
eralized community detection algorithm MM (EM). Observe also that defensive 
preservation of communities greatly improves the algorithm's community detec- 
tion strength. Comparing the results with an analysis of over ten state-of-the- 
art approaches for classical community detection conducted in (24j , we conclude 
that, at least on these networks, MPA performs similarily as the best algo- 
rithms analyzed there. These are hierarchical modularity optimization of Blon- 
del et al. [3] , model selection technique of Rosvall and Bergstrom [33] , spectral 
algorithm proposed by Donetti and Muhoz [11] and multi-resolution spin model 
of Ronhovde and Nussinov [43] . 



HN Benchmark. Next, we also analyze the proposed algorithm on a class of 
benchmark networks with a hierarchical structure [6]. In particular, networks 
are constructed according to a community dendrogram in Fig. 5(a) where leafs 
correspond to eight modules of 16 nodes, while each node d of the dendrogram 
is also associated with a probability pd, Pd S [0, 1]. The nodes of the network are 
linked with the probability associated with the lowest common ancestor in the 
community dendrogram. Varying the values of pd can infer (almost) arbitrary hi- 
erarchical structure of either link-density or link-pattern communities. However, 
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Average degree Number of planted communilies 

(a) Analysis on random graphs 12 (b) Analysis on a resolution limit test [16] 

Fig. 6. Analysis on (a) random graphs [T^ and (b) resolution limit test networks |16j . 
The values are estimates over 100 network realizations, while error bars are smaller 
than the symbol sizes. 



due to simplicity, we associate each level of the nodes with the same probability 
Pd- Thus, denote p = \pi,p2,P3,Pi] to be the vector of respective probabilities 
for the nodes from the lowest to the highest level of the hierarchy, respectively. 

The performance of MPA on five realizations of the above benchmark can be 
seen in Fig. [5] Values of NMI were estimated such that each revealed partition 
was compared against (only) three intrinsic community structures — represented 



by dashed lines in Fig. 5(a) — and the best correspondence was reported. (Note 



that the results are thus actually rather pessimistic.) Observe that MPA can 



accurately reveal the planted structure in all five cases — see legend in Fig. 5(b) — 
which further confirms the adequacy of the proposed community model. More 
precisely, in the first case, the intrinsic network structure results in a hierarchy of 
link-pattern communities, whereas, in the second case, communities are in fact 
defined on two levels of the designed hierarchy. In each of the last three cases, 
the communities corresponds to a single level of the hierarchy. Thus, MPA can 
indeed be employed for the detection of arbitrary community structure. 

Random Graphs. We also apply the algorithms to Erdos-Renyi random graphs |12j 
that presumably have no community structure. We fix the number of nodes to 



128 and vary the average degree from 4 to 32. The results are shown in Fig. 6(a) 
Note that, in contrast to MO(G), neither MPA nor GPA reports any community 
structure for these networks — all nodes are classified into a single community. 

Resolution Limit. We further analyze the algorithms on a resolution limit [16 — 
existence of an intrinsic scale within the algorithm, below which the communities 
are no longer recognized — test benchmarks networks [16] . Hence, the networks 
consist of cliques with 4 nodes that are linked into a ring. Results in Fig. [6(b)] 
reveal that neither MPA nor GPA is seriously attributed to the resolution limit 
issues, whereas, the opposite holds for MO(G). Although some fluctuations are 
indeed observed for MPA, these are not as severe as in the case of modularity [I"6"1. 
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(a) Analysis on karate network [59] 



(b) Analysis on football network [18] 




Stability parameter 

(c) Analysis on women network [5] 



Fig. 7. Analysis of stability and complexity of MPA on three real-world networks 
from Table [T] The values are estimates over at least 100 runs (note different scales). 



Algorithm Stability. As previously discussed, random label update orders severely 
hamper the stability of label propagation, and thus also the robustness of the re- 
vealed community structure |54) . Hence, balanced propagation |51j is employed, 
yet this introduces two parameters A and jit (Section [3]). Value of A is intuitively 
fixed to | (see Eq. (|3|)), while parameter /x in fact controls the stability of the 
algorithm. In Fig. |7fwe analyze MPA with respect to stability parameter \i on 
three real-world networks from Table [TJ Plots show pair- wise distance between 
revealed community structures, and also the number of iterations for the algo- 
rithm to converge (note different scales) . Observe that increasing \i improves the 
stability of MPA in all three networks, however, the number of iterations also 
increases. Furthermore, as one would expect, when [i exceeds a certain thresh- 
old, pair-wise distance between community structures notably increases — some 
number of nodes already gets completely disregarded due to propagation prefer- 
ences close to (see Eq. (|3|)) — while the number of iterations can also increase 



substantially (see Fig. 7(b)[ ). The transition occurs at around /i w 4 for these 
networks, thus, for the analysis throughout the paper, [i is set to 2 (if not stated 
otherwise) . It ought to be mentioned that balanced propagation can also improve 
community detection strength of the basic label propagation [51] (see above). 
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(a) SB benchmark network (b) Analysis of community model in MPA 



Fig. 8. Analysis of community modeling strategy of MPA on SB benchmark networks. 
Node shapes represent planted network communities and are consistent among figures. 
The values are estimates over 100 network realizations, while, for an adequate analysis, 
we set p to and increase fi (Section |3|. See also text. 



Community Modeling. For a comprehensive analysis, we also directly analyze 
the proposed community modeling strategy of MPA on SB benchmark networks 
with confusion degree set to 2 (see above) . In particular, we measure the average 
value of community parameter S c (Section[3]) for the nodes in each of the planted 
network communities. Results in Fig. [8] show that, at least for these networks, 
values of community parameter S c clearly distinguish between link-density and 
link-pattern regime — average S c is close to 1 and for the nodes in link-density 
and link-pattern communities, respectively. Note also that, due to lower aver- 
age degree, values of S c are initially higher for the larger of the two link-pattern 
communities. However, before the algorithm converges — average number of itera- 
tions is shown by a horizontal line in Fig. |8(b)j — community model in MPA infers 
the same average value of 5 C for both link-pattern communities. Note also that 
GPA cannot properly model communities planted in these networks (Fig. |3| . 

Computational Complexity. Basic label propagation and its advances exhibit 
near linear time complexity 0(|L|) [39], where \L\ is the number of links in the 
network. In particular, the exact complexity was estimated to around 0(|i| 12 ) [S3 
Similarly, the proposed model-based propagation MPA exhibits complexity near 
0(fc|L|), where k is the average degree in the network. Although a thorough 
empirical analysis is out of scope of this paper, based on the results in |53j 
(and above), we estimate that MPA should scale up to networks with a million 
links — accessible on a standard desktop computer within an hour. 

Final Remarks. The above analysis on different benchmark networks and ran- 
dom graphs indeed confirms that MPA can reveal arbitrary composites of either 
link-density or link-patter communities, as long as they are clearly depicted in 
the network's topology. Moreover, the proposed community modeling strategy 
also seems more adequate than the approach proposed by Subelj and Bajec [52] 



Table 1. Analysis on real-world networks subject to NMI estimated over 1000 runs (10 
runs for software networks). Corporate network is reduced to the largest component, 
while the known partition is also limited to 86 corporate nodes — we thus set fj, to |. 



Network 


Nodes Links Comm. MO(G) 


GPA 


MM(EM) 


MPA 


Zachary's karate club [59] 


34 78 


2 


0.6925 


0.7155 


0.7870 


0.8949 


American college football [TH] 


115 616 


12 


0.7547 


0.8769 


0.8049 


0.8919 


Davis's southern women [5] 


32 89 


4 




0.7338 


0.8332 


0.8084 


Scottish corpor. interlocks |46| 


217 348 


8 




0.6634 


0.5988 


0.6411 


Java (org namespace) |49j 


709 3571 


47 


0.5029 


0.5190 




0.5187 


Java (javax namespace) [49] 


1595 5287 


107 


0.7048 


0.7369 




0.7386 



for all networks considered. Further note that, although MPA is mostly outper- 
formed by MM(EM) on the benchmarks above, the latter should be attributed 
to the fact that MM(EM) is advised about the number of communities. However, 
this currently cannot be properly estimated for large networks |23j . Moreover, 
MPA also performs significantly better on real- world networks in Section [5j 

5 Real-world examples 

In the following we further employ the proposed algorithm for community de- 
tection in different unipartite and bipartite social networks — classical and fully 
link-pattern community detection, respectively — and also for a generalized com- 
munity detection and predictive data clustering. All of the networks considered 
below are regarded as unweighted and undirected. 

Community Detection We first consider two classical networks for community 
detection — a network of social interations between members of a karate club an- 
alyzed by Zachary (S3] , and a network of interplays in the 2000 NCAA American 
football schedule proposed in [18j — and two well-known bipartite networks — a 
network of social collaborations between women in Natchez, Mississippi col- 
lected by Davis jS], and a network of corporate interlocks in Scotland between 
1904 and 1905 introduced in [IB] (see Table [T]). All these networks have known 
natural community structures that results from earlier studies (see also Fig. [IJ. 

Propagation algorithms — MPA and GPA — most accurately reveal the true 
community structure for main of these networks (Table [lj, whereas, community 
modeling strategy of MPA again seems more adequate than that of GPA. Note 
also that most values of NMI for MPA in Table [I] are considerably high. 

Next, we also consider two software class dependency networks representing 
org and javax namespaces of Java language compiled in [33]. Here, the natu- 
ral community structure should coincide with respective software packages |49j . 
while these are expected to conform with link-density and also link-pattern 



(a) Network adjacency matrix (b) Blockmodel — reordered adj. mat. 



Fig. 9. Community structure of Java software network revealed with MPA (b). 
Only communities with more than 24 nodes are shown, still, the structure con- 
tains 1020 nodes and 4184 links. Link colors correspond to high-level software 
packages — javax. swing, j avax . management , javax.xml, javax. print, javax. naming, 
javax.lang and other — while each dot was enlarged five times for better visibility. 



regime [S5] . Again, propagation algorithms most accurately extract the true net- 
work structures (Table [T]), whereas MM(EM) fails completely. In Fig. [9] we also 
show the community structure of j avax network revealed with MPA that obtains 
NMI = 0.7431. Observe how communities rather agree with high-level software 
packages, whereas, the majority of the links in the network is consistent with 
the revealed structure. Interestingly, some packages contain mainly link-pattern 
communities (e.g., javax . swing), while others are composed of only link-density 
communities (e.g., javax.xml). 

Data Clustering To apply community detection algorithm for data clustering, 
the respective dataset must first be represented by a network using some measure 
of similarity. According to [22], we adopt the inversed Chebyshev distance, with 
initial [0, l]-normalization. In order to obtain a sparse network, links must also be 
thresholded accordingly. (Due to simplicity, we consider only unweighted versions 
of the algorithms.) Note that the resulting network thus commonly decomposes 
into several connected components, however, community detection algorithm can 
still be employed to further partition these components (see Table [2]) . 

We employ community detection to predict class variables of two famous 
datasets — Iris plants dataset introduced by Fisher [T3], and Ecoli protein local- 
ization sites dataset For comparison, in Table[2]we also report the results for 
a classical partitional clustering algorithm K-Means (denoted KM) . Observe 
that MPA obtains extremely promising results on these datasets, while it also 
significantly outperforms MM (EM) and KM that are both advised about the 



Table 2. Analysis of data clustering on two real- world datasets subject to NMI and 
FCC, respectively (estimated over 100 runs). 



Dataset 


Items Classes Links Comp. 


KM 


MM(EM) 


MPA 


Iris plants dataset [14] 


150 3 2405 


2 


0.8234 

0.8227 


0.8113 
0.8196 


0.8264 
0.8983 


Ecoli protein dataset |21| 


336 8 14685 


4 


0.5835 
0.2530 


0.0797 
0.0277 


0.6251 
0.4164 



number of communities. Still, the results could be further improved in various 
ways. (Note that low NMI for MM(EM) on Ecoli dataset is not entirely evident.) 

6 Conclusions 

The paper proposes an enhanced community modeling strategy for a recently 
introduced general propagation algorithm [55]. The resulting algorithm can de- 
tect arbitrary network modules — ranging from link-density communities to link- 
pattern communities — while, in contrast to most other approaches, it requires no 
apriori knowledge about the true structure (e.g., the number of communities). 
The algorithm was evaluated on various benchmark networks with planted par- 
tition, on random graphs and resolution limit test networks, where it is shown to 
be at least comparable to current state-of-the-art. Moreover, to demonstrate its 
generality, the algorithm was also employed for community detection in differ- 
ent unipartite and bipartite social networks, for generalized community detection 
and data clustering. The results imply that the proposed community model pro- 
vides an adequate approximation of the real- world network structure, although, 
recent work suggests that network clustering and degree mixing could be even 
further utilized within the model |48|17t 55 41 . The latter will be considered for 
future work. (For supporting website see |http : //lovro . lpt . f ri . uni-1 j . si/ ) 
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